Floating-Point Arithmetic

Docs for Reference
The problem of the Floating-Point
Rounding Error
- Floating-point Formats

Docs for Reference

What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg ./What Every Computer Scientist Should Know About Floating-Point Arithmetic.pdf
Single-precision floating-point format(wikipedia) http://en.wikipedia.org/wiki/Single_precision
Double-precision floating-point format http://en.wikipedia.org/wiki/Double_precision
Floating point http://en.wikipedia.org/wiki/Floating_point

The problem of the Floating-Point

For example, the real number ‘.37’ cannot be represented exactly by the arithmetic series described above so, if you assign this number to a floating point, the value stored could actually be ‘0.370000004’. This can be seen easily if we write a simple program that prints a floating point value to a lot of decimal places.

// some code to print a floating point number to a lot of 
// decimal places
int main()
{
    float f = .37;
    printf("%.20f\n", f);
}

More examples:

Rounding Error

Floating-point Formats

Several different representations of real numbers have been proposed, but by far the most widely used is the floating-point representation.¹Floating-point representations have a base β (which is always assumed to be even) and a precision p. If β = 2 and p = 24, then the decimal number 0.1 cannot be represented exactly, but is approximately 1.10011001100110011001101 × 2-4.

In general, a floating-point number will be represented as ± d.dd… d ×β^e, where d.dd…d is called the significand and has p digits. More precisely ±d₀.d₁ d₂ … d_p-1 ×β^e represents the number

± (d₀+d₁β^-1+…+d_p-1β^-(p-1))β^e, (0< d_i <β)

Footnotes:

Examples of other representations are floating slash and signed logarithm [Matula and Kornerup 1985; Swartzlander and Alexopoulos 1975].